Using landscape genomics to delineate seed and breeding zones for lodgepole pine

Summary Seed and breeding zones traditionally are delineated based on local adaptation of phenotypic traits associated with climate variables, an approach requiring long‐term field experiments. In this study, we applied a landscape genomics approach to delineate seed and breeding zones for lodgepole pine. We used a gradient forest (GF) model to select environment‐associated single nucleotide polymorphisms (SNPs) using three SNP datasets (full, neutral and candidate) and 20 climate variables for 1906 lodgepole pine (Pinus contorta) individuals in British Columbia and Alberta, Canada. The two GF models built with the full (28 954) and candidate (982) SNPs were compared. The GF models identified winter‐related climate as major climatic factors driving genomic patterns of lodgepole pine’s local adaptation. Based on the genomic gradients predicted by the full and candidate GF models, lodgepole pine distribution range in British Columbia and Alberta was delineated into six seed and breeding zones. Our approach is a novel and effective alternative to traditional common garden approaches for delineating seed and breeding zone, and could be applied to tree species lacking data from provenance trials or common garden experiments.


Fig. S1
Split density graph of all 20 climate variables involved in the full gradient forest model.

Table S1
Predictor importance for 20 climate variables generated by the full and the candidate gradient forest model.

Table S2
Within-cluster variation and reduction in within-cluster variation when a different number of clusters was applied to the full and the candidate model predicted continuous genomic variation for lodgepole pine.

Table S3
Backward comparison between the two sets of common-garden based seed and breeding zones delineated by Liepe et al. (2016;nine zones) and by Ukrainetz et al. (2018;four zones) with the corresponding nine and four seed and breeding zones delineated based on the full and the candidate models for lodgepole pine in western Canada (see Fig. S7  Methods S1 Description of the gradient forest model fitting process.

Fig. S1
Split density graph of all 20 climate variables involved in the full gradient forest model. See Table 1 for descriptions of climate variable abbreviations. Grey bars show the binned raw importance density generated by the random forest output. The black line shows raw importance density I(x). The red line indicates the density of data d(x). The blue line is the estimated importance f(x) = I(x)/d(x). The dashed horizontal line marks where the f(x) ratio is at one (Ellis et al., 2012).

Fig. S2
Split density graph of all 20 climate variables involved in the candidate gradient forest model. See Table 1 for descriptions of climate variable abbreviations. Grey bars show the binned raw importance density generated by the random forest output. The black line shows raw importance density I(x). The red line indicates the density of data d(x). The blue line is the estimated importance f(x) = I(x)/d(x). The dashed horizontal line marks where the f(x) ratio is at one (Ellis et al., 2012).

Fig. S3
Cumulative importance curves for all 20 climate variables involved in the full gradient forest model. See Table 1 for descriptions of climate variable abbreviations.

Fig. S4
Cumulative importance curves for all 20 climate variables involved in the candidate gradient forest model. See Table 1 for descriptions of climate variable abbreviations.

Fig. S5 Principal components analysis biplot of the full (a) and the candidate (b) gradient forest (GF)-predicted genomic variation for
British Columbia and Alberta, Canada. Each point in the biplot represents the principal component (PC) score for locations in the lodgepole pine distribution range in British Columbia and Alberta, Canada. See Table 1 for descriptions of climate variable abbreviations.

Fig. S6
Elevation plot and regional names for British Columbia and Alberta, Canada.       Wang et al., 2012).
Gradient forest (GF) is based on an aggregation of random forests. It builds a random forest for each of the SNPs with the goodness-of-fit for each SNP being presented as R 2 . Only SNPs with R 2 values above zero were considered to be associated with the predictors (having a positive predictive power) and were employed to build the final GF model. For each of the GF-selected SNPs, splits in every regression tree of the random forest are dependent on the smallest impurity (sum of the squared deviation of group mean). The importance of each split, representing the amount of variation explained by that split, was recorded as the impurity importance (also known as raw importance). The impurity importance for each SNP along each predictor gradient generated an impurity importance bar graph, which was then placed in order in an array. This array contained information for all GF-selected SNPs and all predictors included in model fitting.
The impurity importance was then aggregated and averaged among all GF-selected SNPs for each predictor to provide the information for the overall predictor importance.
Raw importance gathered across all GF-selected SNPs for each predictor was then combined into the split importance graph using kernel density estimation. The split importance graph, including the information on binned raw importance and a kernel density estimation of the binned bars, was named the raw importance density. The raw importance density curve was then scaled by data density, resulting in the estimated importance curve, which was further normalized to ensure the area under the estimated importance curve for each predictor sums up to the predictor importance. Finally, cumulative importance curves for each predictor were generated by taking the integral of the estimated importance. Information from the cumulative importance curves converted environmental gradients into genomic gradients in multidimensional space, which serve as the basis for the model prediction.